Multiple Sequence Alignment for Morphology Induction
نویسندگان
چکیده
MetaMorph is a novel application of multiple sequence alignment (MSA) to natural language morphology induction. Given a text corpus in any language, we sequentially align a subset of the words of the corpus to form an MSA using a probabilistic scoring scheme. We then segment the MSA to produce output analyses. We used this algorithm to compete in the 2009 Morpho Challenge. The F-measure of the analyses produced by MetaMorph are low for the full development corpus, but high for the corpus subsets used to generate the MSA, even surpassing the F-measure of another system used to aid MSA segmentation. This suggests that MSA is an effective algorithm for unsupervised morphology induction and may yet outperform the state-ofthe-art morphology induction algorithms. Future research directions are discussed.
منابع مشابه
An Application of the ABS LX Algorithm to Multiple Sequence Alignment
We present an application of ABS algorithms for multiple sequence alignment (MSA). The Markov decision process (MDP) based model leads to a linear programming problem (LPP), whose solution is linked to a suggested alignment. The important features of our work include the facility of alignment of multiple sequences simultaneously and no limit for the length of the sequences. Our goal here is to ...
متن کاملMorphological Analysis by Multiple Sequence Alignment
In biological sequence processing, Multiple Sequence Alignment (MSA) techniques capture information about long-distance dependencies and the three-dimensional structure of protein and nucleotide sequences without resorting to polynomial complexity context-free models. But MSA techniques have rarely been used in natural language (NL) processing, and never for NL morphology induction. Our MetaMor...
متن کاملA Sort-based Algorithm for Multiple Sequence Alignment *
We propose a sort-based algorithm for multiple sequence alignment using anchors. Anchors are determined by the use of suffix sorting along with position-based sorts. Potential anchor points are identified by a careful exploitation of the sorted suffixes obtained from a generalized suffix array of the input sequences. Final alignment is obtained by a recursive application of the suffix-sorting a...
متن کاملA Bayesian Mixture Model for Part-of-Speech Induction Using Multiple Features
In this paper we present a fully unsupervised syntactic class induction system formulated as a Bayesian multinomial mixture model, where each word type is constrained to belong to a single class. By using a mixture model rather than a sequence model (e.g., HMM), we are able to easily add multiple kinds of features, including those at both the type level (morphology features) and token level (co...
متن کاملA Bayesian Mixture Model for PoS Induction Using Multiple Features
In this paper we present a fully unsupervised syntactic class induction system formulated as a Bayesian multinomial mixture model, where each word type is constrained to belong to a single class. By using a mixture model rather than a sequence model (e.g., HMM), we are able to easily add multiple kinds of features, including those at both the type level (morphology features) and token level (co...
متن کامل